Goto

Collaborating Authors

 categorical feature and repeated measure


Using Random Effects to Account for High-Cardinality Categorical Features and Repeated Measures in Deep Neural Networks

Neural Information Processing Systems

High-cardinality categorical features are a major challenge for machine learning methods in general and for deep learning in particular. Existing solutions such as one-hot encoding and entity embeddings can be hard to scale when the cardinality is very high, require much space, are hard to interpret or may overfit the data. A special scenario of interest is that of repeated measures, where the categorical feature is the identity of the individual or object, and each object is measured several times, possibly under different conditions (values of the other features). We propose accounting for high-cardinality categorical features as random effects variables in a regression setting, and consequently adopt the corresponding negative log likelihood loss from the linear mixed models (LMM) statistical literature and integrate it in a deep learning framework. We test our model which we call LMMNN on simulated as well as real datasets with a single categorical feature with high cardinality, using various baseline neural networks architectures such as convolutional networks and LSTM, and various applications in e-commerce, healthcare and computer vision. Our results show that treating high-cardinality categorical features as random effects leads to a significant improvement in prediction performance compared to state of the art alternatives. Potential extensions such as accounting for multiple categorical features and classification settings are discussed. Our code and simulations are available at https://github.com/gsimchoni/lmmnn.


Using Random Effects to Account for High-Cardinality Categorical Features and Repeated Measures in Deep Neural Networks

Neural Information Processing Systems

Figure 2: Real data predicted vs. true results and category size distribution Python 3.8 Numpy + Pandas suite, Keras and Tensorflow Code is fully available in the lmmnn package on Github Running code: see details in package README file 3 n = 100, 000, σ At each run 80% (80,000) of the simulated data is used as training set, of which 10% (8,000) is used as validation set which the network only uses to check for early stopping. Embedding layer which maps q levels to a d = 0 .1 q vector, so input dimension is p + d - Physical activity (P A) definition: Subjects wore an accelerometer on their wrist for 7 days. ENMO in m-g was summarised across valid wear-time. ETL: We follow instructions by Pearce et al. (2020), implemented in R. At high level, we "once a week" is converted to 1 and "every day" is converted to 7. Finally the P A dependent variable is standardized to have a Baseline DNN architecture: Pearce et al. did not use DNNs, but two separate linear regressions, for men and women. ReLU activation of 10 and 5 neurons, followed by a single output neuron with no activation.


Using Random Effects to Account for High-Cardinality Categorical Features and Repeated Measures in Deep Neural Networks

Neural Information Processing Systems

A special scenario of interest is that of repeated measures, where the categorical feature is the identity of the individual or object, and each object is measured several times, possibly under different conditions (values of the other features).


Using Random Effects to Account for High-Cardinality Categorical Features and Repeated Measures in Deep Neural Networks

Neural Information Processing Systems

High-cardinality categorical features are a major challenge for machine learning methods in general and for deep learning in particular. Existing solutions such as one-hot encoding and entity embeddings can be hard to scale when the cardinality is very high, require much space, are hard to interpret or may overfit the data. A special scenario of interest is that of repeated measures, where the categorical feature is the identity of the individual or object, and each object is measured several times, possibly under different conditions (values of the other features). We propose accounting for high-cardinality categorical features as random effects variables in a regression setting, and consequently adopt the corresponding negative log likelihood loss from the linear mixed models (LMM) statistical literature and integrate it in a deep learning framework. We test our model which we call LMMNN on simulated as well as real datasets with a single categorical feature with high cardinality, using various baseline neural networks architectures such as convolutional networks and LSTM, and various applications in e-commerce, healthcare and computer vision.